AITopics | Sea of Marmara

Collaborating Authors

Sea of Marmara

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

arXiv.org Artificial IntelligenceOct-15-2025

High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.10681

Country:

Asia > Middle East (0.28)
North America > United States (0.28)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

The Illusion of Rights based AI Regulation

Mei, Yiyang, Sag, Matthew

arXiv.org Artificial IntelligenceFeb-26-2025

Whether and how to regulate AI is one of the defining questions of our times - a question that is being debated locally, nationally, and internationally. We argue that much of this debate is proceeding on a false premise. Specifically, our article challenges the prevailing academic consensus that the European Union's AI regulatory framework is fundamentally rights-driven and the correlative presumption that other rights-regarding nations should therefore follow Europe's lead in AI regulation. Rather than taking rights language in EU rules and regulations at face value, we show how EU AI regulation is the logical outgrowth of a particular cultural, political, and historical context. We show that although instruments like the General Data Protection Regulation (GDPR) and the AI Act invoke the language of fundamental rights, these rights are instrumentalized - used as rhetorical cover for governance tools that address systemic risks and maintain institutional stability. As such, we reject claims that the EU's regulatory framework and the substance of its rules should be adopted as universal imperatives and transplanted to other liberal democracies. To add weight to our argument from historical context, we conduct a comparative analysis of AI regulation in five contested domains: data privacy, cybersecurity, healthcare, labor, and misinformation. This EU-US comparison shows that the EU's regulatory architecture is not meaningfully rights-based. Our article's key intervention in AI policy debates is not to suggest that the current American regulatory model is necessarily preferable but that the presumed legitimacy of the EU's AI regulatory approach must be abandoned.

llusion, regulation, rights, (14 more...)

arXiv.org Artificial Intelligence

2503.05784

Country:

North America > United States > New York (0.04)
North America > United States > Illinois (0.04)
Europe > Russia (0.04)
(29 more...)

Genre:

Overview (0.92)
Research Report (0.82)

Industry:

Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Providers & Services > Reimbursement (1.00)
(3 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Holmberg, Daniel, Clementi, Emanuela, Roos, Teemu

arXiv.org Artificial IntelligenceNov-20-2024

Accurate ocean forecasting systems are vital for understanding marine dynamics, which play a crucial role in environmental management and climate adaptation strategies. Traditional numerical solvers, while effective, are computationally expensive and time-consuming. Recent advancements in machine learning have revolutionized weather forecasting, offering fast and energy-efficient alternatives. Building on these advancements, we introduce SeaCast, a neural network designed for high-resolution, medium-range ocean forecasting. SeaCast employs a graph-based framework to effectively handle the complex geometry of ocean grids and integrates external forcing data tailored to the regional ocean context. Our approach is validated through experiments at a high spatial resolution using the operational numerical model of the Mediterranean Sea provided by the Copernicus Marine Service, along with both numerical and data-driven atmospheric forcings.

forcing, forecast, seacast, (15 more...)

arXiv.org Artificial Intelligence

2410.11807

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Gibraltar (0.04)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara > Dardanelles (0.04)
(13 more...)

Genre: Research Report (1.00)

Industry: Transportation > Marine (0.34)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Chen, Yuyan, Wu, Chenwei, Yan, Songzhou, Liu, Panjun, Zhou, Haoyu, Xiao, Yanghua

arXiv.org Artificial IntelligenceAug-20-2024

Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

huxley, opération, python 3, (13 more...)

arXiv.org Artificial Intelligence

2408.10947

Country:

Africa > Middle East > Egypt (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(23 more...)

Genre:

Research Report > New Finding (0.47)
Personal > Interview (0.46)
Instructional Material > Course Syllabus & Notes (0.34)

Industry:

Materials > Chemicals > Industrial Gases (1.00)
Education > Assessment & Standards (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model

Olug, Eren, Kaya, Kiymet, Tugay, Resul, Oguducu, Sule Gunduz

arXiv.org Artificial IntelligenceAug-2-2024

Road traffic congestion prediction is a crucial component of intelligent transportation systems, since it enables proactive traffic management, enhances suburban experience, reduces environmental impact, and improves overall safety and efficiency. Although there are several public datasets, especially for metropolitan areas, these datasets may not be applicable to practical scenarios due to insufficiency in the scale of data (i.e. number of sensors and road links) and several external factors like different characteristics of the target area such as urban, highways and the data collection location. To address this, this paper introduces a novel IBB Traffic graph dataset as an alternative benchmark dataset to mitigate these limitations and enrich the literature with new geographical characteristics. IBB Traffic graph dataset covers the sensor data collected at 2451 distinct locations. Moreover, we propose a novel Road Traffic Prediction Model that strengthens temporal links through feature engineering, node embedding with GLEE to represent inter-related relationships within the traffic network, and traffic prediction with ExtraTrees. The results indicate that the proposed model consistently outperforms the baseline models, demonstrating an average accuracy improvement of 4%.

dataset, node, prediction, (12 more...)

arXiv.org Artificial Intelligence

2408.01016

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.06)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.06)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
(2 more...)

Genre: Research Report (0.83)

Industry: Transportation > Infrastructure & Services (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Halu-J: Critique-Based Hallucination Judge

Wang, Binjie, Chern, Steffi, Chern, Ethan, Liu, Pengfei

arXiv.org Artificial IntelligenceJul-17-2024

Large language models (LLMs) frequently generate non-factual content, known as hallucinations. Existing retrieval-augmented-based hallucination detection approaches typically address this by framing it as a classification task, evaluating hallucinations based on their consistency with retrieved evidence. However, this approach usually lacks detailed explanations for these evaluations and does not assess the reliability of these explanations. Furthermore, deficiencies in retrieval systems can lead to irrelevant or partially relevant evidence retrieval, impairing the detection process. Moreover, while real-world hallucination detection requires analyzing multiple pieces of evidence, current systems usually treat all evidence uniformly without considering its relevance to the content. To address these challenges, we introduce Halu-J, a critique-based hallucination judge with 7 billion parameters. Halu-J enhances hallucination detection by selecting pertinent evidence and providing detailed critiques. Our experiments indicate that Halu-J outperforms GPT-4o in multiple-evidence hallucination detection and matches its capability in critique generation and evidence selection. We also introduce ME-FEVER, a new dataset designed for multiple-evidence hallucination detection. Our code and dataset can be found in https://github.com/GAIR-NLP/factool .

aegean sea, critique, information, (15 more...)

arXiv.org Artificial Intelligence

2407.12943

Country:

Atlantic Ocean > Black Sea (0.06)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara > Dardanelles (0.04)
Asia > Middle East > Republic of Türkiye (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Media > Television (1.00)
Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

Dong, Guanting, Zhu, Yutao, Zhang, Chenghao, Wang, Zechen, Dou, Zhicheng, Wen, Ji-Rong

arXiv.org Artificial IntelligenceJun-26-2024

Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the diverse LLMs' knowledge preferences inevitably poses an inevitable challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrate pair-wise, point-wise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs' internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions also provide empirical guidance for achieving reliable RAG systems. Our code is publicly available at https://github.com/dongguanting/DPA-RAG.

knowledge, language model, llm, (14 more...)

arXiv.org Artificial Intelligence

2406.18676

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.28)
North America > United States > Washington > King County > Seattle (0.14)
Africa > Tanzania (0.05)
(41 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Long, Lin, Wang, Rui, Xiao, Ruixuan, Zhao, Junbo, Ding, Xiao, Chen, Gang, Wang, Haobo

arXiv.org Artificial IntelligenceJun-14-2024

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.

computational linguistic, data generation, language model, (15 more...)

arXiv.org Artificial Intelligence

2406.15126

Country:

Asia > Singapore (0.05)
North America > Canada > Ontario > Toronto (0.04)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara > Dardanelles (0.04)
(14 more...)

Genre:

Overview (1.00)
Research Report (0.64)

Industry:

Health & Medicine (0.67)
Media > Film (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

Lu, Bin, Zhao, Ze, Han, Luyu, Gan, Xiaoying, Zhou, Yuntao, Zhou, Lei, Fu, Luoyi, Wang, Xinbing, Zhou, Chenghu, Zhang, Jing

arXiv.org Artificial IntelligenceMay-12-2024

Accurately reconstructing the global ocean deoxygenation over a century is crucial for assessing and protecting marine ecosystem. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precise reconstruction. In this work, we propose OxyGenerator, the first deep learning based model, to reconstruct the global ocean deoxygenation from 1920 to 2023. Specifically, to address the heterogeneity across large temporal and spatial scales, we propose zoning-varying graph message-passing to capture the complex oceanographic correlations between missing values and sparse observations. Additionally, to further calibrate the uncertainty, we incorporate inductive bias from dissolved oxygen (DO) variations and chemical effects. Compared with in-situ DO observations, OxyGenerator significantly outperforms CMIP6 numerical simulations, reducing MAPE by 38.77%, demonstrating a promising potential to understand the "breathless ocean" in data-driven manner.

enerator, reconstructing global ocean deoxygenation, reconstruction, (9 more...)

arXiv.org Artificial Intelligence

2405.07233

Country:

North America > United States > District of Columbia > Washington (0.14)
Europe > Austria > Vienna (0.14)
Atlantic Ocean > Black Sea (0.05)
(26 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.46)
Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A novel interface for adversarial trivia question-writing

Liu, Jason

arXiv.org Artificial IntelligenceMar-11-2024

A critical component when developing question-answering AIs is an adversarial dataset that challenges models to adapt to the complex syntax and reasoning underlying our natural language. Present techniques for procedurally generating adversarial texts are not robust enough for training on complex tasks such as answering multi-sentence trivia questions. We instead turn to human-generated data by introducing an interface for collecting adversarial human-written trivia questions. Our interface is aimed towards question writers and players of Quiz Bowl, a buzzer-based trivia competition where paragraph-long questions consist of a sequence of clues of decreasing difficulty. To incentivize usage, a suite of machine learning-based tools in our interface assist humans in writing questions that are more challenging to answer for Quiz Bowl players and computers alike. Not only does our interface gather training data for the groundbreaking Quiz Bowl AI project QANTA, but it is also a proof-of-concept of future adversarial data collection for question-answering systems. The results of performance-testing our interface with ten originally-composed questions indicate that, despite some flaws, our interface's novel question-writing features as well as its real-time exposure of useful responses from our machine models could facilitate and enhance the collection of adversarial questions. The code for our interface is available at: https://github.com/Zefan-Cai/QAML

interface, quiz bowl, reasoning, (14 more...)

arXiv.org Artificial Intelligence

2404.00011

Country:

South America > Paraguay (0.04)
North America > United States > Maryland (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(11 more...)

Genre: Research Report (0.82)

Industry:

Education (0.48)
Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.76)

Add feedback